Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

BONN: Bayesian Optimized Binary Neural Network

Algorithm 6 Pruning 1-bit CNNs with Bayesian learning

Input:

The pre-trained 1-bit CNN model with parameters K, the reconstruction vector w, the learning

rate η, regularization parameters λ, θ, variance ν and convergence rate γ and the training

dataset.

Output:

The pruned BONN with updated K, w, μ, σ, cm, σm.

1: repeat

// Forward propagation

for l = 1 to L do

K^l

i,j ^{= (1}⁻^γ⁾^K^l

i,j ⁺^γ^K

j^;

ˆk^l

i ⁼^w^l^◦^sign(^k^l

i⁾^,^∀ⁱ^{; // Each element of}^w^l^{is replaced by the average of all elements}^w^l^.

Perform activation binarization; // Using the sign function

Perform 2D convolution with ^ˆk^l

i^,^∀ⁱ^;

end for

// Backward propagation

10:

Compute δˆkl

i ⁼^∂L^s

∂^ˆk^l

i ^,^∀^{l, i}^;

11:

for l = L to 1 do

12:

Calculate δkl

i^,^δ^w^l^,^δ^μ^l

i^,^δ^σ^l

i^{; // using Eqs. 3.115}^∼^3.120

13:

Update parameters k^l

i^,^w^l^{, μ}^l

i^{, σ}^l

i ^{using SGD;}

14:

end for

15:

Update cm, σm;

16: until Filters in the same group are similar enough

Updating K^l

i,j^{: In pruning, we aim to converge the ﬁlters to their mean gradually. So}

we replace each ﬁlter K^l

i,j ^{with its corresponding mean}^K

i,j^{. The gradient of the mean is}

represented as follows:

∂L

∂K^l

i,j

= ^∂L^S

∂K^l

i,j

+ ^∂L^B

∂K^l

i,j

+ ^∂L^P

∂K^l

i,j

= ^∂L^S

∂K

∂K^l

i,j

+ ^∂L^B

∂K

∂K^l

i,j

+ ^∂L^P

∂K^l

i,j

= ¹

∂LS

∂K

+ ^∂L^B

∂K

+ 2(K^l

i,j⁻^K^j⁾

+ 2ν(Ψ^l

j⁾⁻¹⁽^K^l

i,j⁻^K^j⁾^,

(3.120)

where K

j ⁼

i=1 ^K^l

i,j ^{that is used to update the ﬁlters in a group by mean}^K

j^{. We}

leave the ﬁrst ﬁlter in each group to prune redundant ﬁlters and remove the others. However,

such an operation changes the distribution of the input channel of the batch norm layer,

resulting in a dimension mismatch for the next convolutional layer. To solve the problem,

we keep the size of the batch norm layer, whose values correspond to the removed ﬁlters, set

to zero. In this way, the removed information is retained to the greatest extent. In summary,

we show that the proposed method is trainable from end to end. The learning procedure is

detailed in Algorithms 5 and 6.